A Description Language for Syntactically Annotated Corpora

نویسندگان

  • Esther König
  • Wolfgang Lezius
چکیده

This paper introduces a description language for syntactically annotated corpora which allows for encoding both the syntactic annotation to a corpus and the queries to a syntactically annotated corpus. In terms of descriptive adequacy and computational efficiency, the description language is a compromise between script-like corpus query languages and high-level, typed unification-based grammar formalisms. 1 I n t r o d u c t i o n Syntactically annotated corpora like the Penn Treebank (Marcus et al., 1993), the NeGra corpus (Skut et al., 1998) or the statistically dismnbiguated parses in (Bell et al., 1999) provide a wealth of intbrmation, which can only be exploited with an adequate query language. For example, one might want to retrieve verbs with their sentential complements, or specific fronting or extraposition phenomena. So far, queries to a treebank have been formulated in scripting languages like tgrep, Perl or others. Recently, some powerful query languages have been developed: an exalnple of a highlevel, constraint-based language is described in (Duchier and Niehren, 1999). (Bird et al., 2000) propose a query language for the general concept of annotation grat)hs,, A graphical query notation tbr trees is under development in the ICE project (UCL, 2000). In the current paper, we present a proposal for a graph description language which is meant to fulfill two conflicting requirements: On the one hand, the language should be close to traditional linguistic descriptions languages, i.e. to grammar formalisms, as a basis for modular, understandable code, even for complex corpus queries. On the other lmnd, the language should not preclude etlicient query evaluation. Our answer is to profit from the research on typed, feature-based/constraintbased grammar tbrmalisms (e.g. (Carpenter, 1992), (Copestake, 1999), (DSrre and Dorna, 1993), (D6I're et al., 1996), (Emele and Zajac, 1990), (H6ht~ld and Smolka, 1988)), and to pick those ingredients which are known to be con~i)utationally 'tractable' in some sense. 2 The Query Language 2.1 The r igh t k ind of graphs If syntactic analysis is meant to provide for a basis of semantic interpretation, the predicate-argulnent structure of a sentence nmst be recoverable fi'om its syntactic analysis. Nonlocal dependencies like topicalization, right extraposition, tell us that tr'ccs are not expressive enough. We need a way to connect an extraposed constituent with its syntactic resp. semantic head. This can be done either by introducing empty leaf nodes plus a means for node coreference (like in the Penn Treebank) or by admitting crossing edges. In our project, the latter solution has been chosen (Skut et al., 1997), partly tbr the reason that it is simpler to annotate (no decision on the right place of a trace has to be taken). We call this extension of trees with crossing edges syntaz graphs. An example is shown in Fig. 1. In order to discuss the details of the language, we will make reference to the simpler syntax graph in Fig. 2.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Syntactically annotated corpora of Estonian

Syntactically annotated corpora are needed 1) to train and test parsers and various language technological products grammar checkers, information retrievers and extractors, machine translators etc; 2) to check the agreement of existing linguistic theories with the real language usage. The corpora can be annotated on different levels of depth. In shallow syntactically annotated corpora a syntact...

متن کامل

Finite Structure Query: A Tool for Querying Syntactically Annotated Corpora

Finite structure query (fsq for short) is a tool for querying syntactically annotated corpora. fsq employs a query language of high expressive power, namely full first order logic. It can be used to query arbitrary finite structures, not just trees.

متن کامل

Morphologically and Syntactically Annotated Corpora of Many Languages

Annotated corpora have become a standard resource for research in both linguistics and computational processing of natural languages. Lexicographers judge word usage and distribution by occurrences in corpora; part-of-speech tags may help them narrow their queries. Grammarians may use syntactically annotated corpora (treebanks) for queries such as “show me all examples where a verb governs two ...

متن کامل

Mining Syntactically Annotated Corpora with XQuery

This paper presents a uniform approach to data extraction from syntactically annotated corpora encoded in XML. XQuery, which incorporates XPath, has been designed as a query language for XML. The combination of XPath and XQuery offers flexibility and expressive power, while corpus specific functions can be added to reduce the complexity of individual extraction tasks. We illustrate our approach...

متن کامل

VIQTORYA -- A Visual Query Tool for Syntactically Annotated Corpora

This paper presents a query tool for syntactically annotated corpora. The query tool is developed to search the Tübingen Treebanks annotated at the University of Tübingen. However, in principle it also can be adapted to other corpora. The tool uses a query language that allows to search for tokens, syntactic categories, grammatical functions and binary relations of (immediate) dominance and lin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000